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ABSTRACT 



A method and apparatus for securing access to a service or 
facility employing automatic speech recognition, text- 
independent speaker identification, natural language under- 
standing techniques and additional dynamic and static fea- 
tures. The method includes the steps of receiving and 
decoding speech containing indicia of the speaker such as a 
name, address or customer number; accessing a database 
containing information on candidate speakers; questioning 
the speaker based on the information; receiving, decoding 
and verifying an answer to the question; obtaining a voice 
sample of the speaker and verifying the voice sample against 
a model; generating a score based on the answer and the 
voice sample; and granting access if the score is equal to or 
greater than a threshold. Alternatively, the method includes 
the steps of receiving and decoding speech containing 
indicia of the speaker; generating a sub-list of speaker 
candidates having indicia substantially matching the 
speaker; activating databases containing information about 
the speaker candidates in the sub-list; performing voice 
classification analysis; eliminating speaker candidates based 
. on the voice classification analysis; questioning the speaker 
regarding the information; eliminating speaker candidates 
based on the answer; and iterative ly repeating prior steps 
until one speaker candidate (in which case the speaker is 
granted access), or no speaker candidate remains (in which 
case the speaker is not granted access). 

35 Claims, 5 Drawing Sheets 
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APPARATUS AND METHODS FOR SPEAKER 
VERIFICATION/IDENTIFICATION/ 
CLASSIFICATION EMPLOYING NON- 
ACOUSTIC AND/OR ACOUSTIC MODELS 
AND DATABASES 

RELATED APPLICATION DATA 

This application is related to the application (Docket 
Number Y0997-136) entitled "Portable Acoustic Interface 
For Remote Access To Automatic Speech/Speaker Recog- 
nition Server", which is commonly assigned and is filed 
concurrently with the present invention. 

BACKGROUND OF THE INVENTION 

The present invention relates to methods and apparatus 
for providing security with respect to user access of services 
and/or facilities and, more particularly, to methods and 
apparatus for providing same employing automatic speech 
recognition, text-independent speaker identification, natural 
language understanding techniques and additional dynamic 
and static features. 

In many instances, it is necessary to verify that an 
individual requesting access to a service or a facility is in 
fact authorized to access the service or facility. For example, 
such services may include banking services, telephone 
services, or home video provision services, while the facili- 
ties may be, for example, banks, computer systems, or 
database systems. In such situations, users typically have to 
write down, type or key in (e.g., on a keyboard) certain 
information in order to send an order, make a request, obtain 
a service, perform a transaction or transmit a message. 

Verification or authentication of a customer prior to 
obtaining access to such services or facilities typically relies 
essentially on the customer's knowledge of passwords or 
personal identification numbers (PINs) or by the customer 
interfacing with a remote operator who verifies the custom- 
er's knowledge of information such as name, address, social 
security number, city or date of birth, mother's maiden 
name, etc. In some special transactions, handwriting recog- 
nition or signature verification is also used. 

However, such conventional user verification techniques 
present many drawbacks. First, information typically used to 
verify a user's identity may be easily obtained. Any perpe- 
trator who is reasonably prepared to commit fraud usually 
finds it easy to obtain such personal information such as a 
social security number, mother's maiden name or date of 
birth of his intended target. Regarding security measures for 
more complex knowledge-based systems which require 
passwords, PINs or knowledge of the last transaction/ 
message provided during the previous service, such mea- 
sures are also not reliable mainly because the user is usually 
unable to remember this information or because many users 
write the information down thus making the fraudulent 
perpetrator's job even easier. For instance, it is known that 
the many unwitting users actually write their PINs on the 
back of their ATM or smart card. 

The shortcomings inherent with the above-discussed 
security measures have prompted an increasing interest in 
bio metric security technology, i.e., verifying a person's 
identity by personal biological characteristics. Several bio- 
metric approaches are known. However, one disadvantage 
of biometric approaches, with the exception of voice-based 
verification, is that they are expensive and cumbersome to 
implement. This is particularly true for security measures 
involved in remote transactions, such as internet-based or 
telephone-based transaction systems. 
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Voice-based verification systems are especially useful 
when it is necessary to identify a user who is requesting 
telephone access to a service/facility but whose telephone is 
not equipped with the particular pushbutton capability that 

5 would allow him to electronically send his identification 
password. Such existing systems which employ voice-based 
verification utilize only the acoustic characteristics of the 
utterances spoken by the user. As a result, existing voice 
identification methods, e.g., such as is disclosed in the 

1Q article: S. Furui, "An Overview of Speaker Recognition", 
Automatic Speech and Speaker Recognition, Advanced 
Topics, Kluwer Academic Publisher, edited by C. Lee, F. 
Soong and K. Paliwal, cannot guarantee a reasonably accu- 
rate or fast identification particularly when the user is calling 

15 from a noisy environment or when the user must be iden- 
tified from among a very large database of speakers (e.g., 
several million voters). Further, such existing systems are 
often unable to attain the level of security expected by most 
service providers. Still further, even when existing voice 

2Q verification techniques are applied under constrained 
conditions, whenever the constraints are modified as is 
required from time to time, verification accuracy becomes 
unpredictable. Indeed, at the stage of development of the 
prior art, it is clear that the understanding of the properties 

25 of voice prints over large populations, especially over tele- 
phone (i.e., land or cellular, analog or digital, with or without 
speakerphones, with or without background noise, etc.), is 
not fully mastered. 

Furthermore, most of the existing voice verification sys- 

30 terns are text-dependent or text -prompted which means that 
the system knows the script of the utterance repeated by the 
user once the identity claim is made. In fact in some systems, 
the identity claim is often itself part of the tested utterance; 
however, this does not change in any significant way the 

35 limitations of the conventional approaches. For example, a 
text-dependent system cannot prevent an intruder from using 
a pre-recorded tape with a particular speaker's answers 
recorded thereon in order to breach the system. 
Text- independent speaker recognition, as the technology 

40 used in the embodiments presented in the disclosure of U.S. 
Ser. No. 08/788,471, overcomes many disadvantages of the 
text-dependent speaker recognition approach discussed 
above. But there are still several issues which exist with 
respect to text-independent speaker recognition, in and of 

45 itself. In many applications, text-independent speaker rec- 
ognition requires a fast and accurate identification of the 
identity of a user from among a large number of other 
prospective users. This problem is especially acute when 
thousands of users must be processed simultaneously within 

50 a short time period and their identities have to be verified 
from a database that stores millions of user's prototype 
voices. 

In order to restrict the number of prospective users to be 
considered by a speech recognition device and to speed up 

55 the recognition process, it has been suggested to use a "fast 
match" technique on a speaker, as disclosed in the patent 
application (U.S. Ser. No. 08/851,982) entitled, "Speaker 
Recognition Over Large Population with Combined Fast and 
Detailed Matches", filed on May 6, 1997. While this pro- 

60 cedure is significantly faster than a "detailed match" speaker 
recognition technique, it still requires processing of acoustic 
prototypes for each user in a database. Such a procedure can 
still be relatively time consuming and may generate a large 
list of candidate speakers that are too extensive to be 

65 processed by the recognition device. 

Accordingly, among other things, it would be advanta- 
geous to utilize a language model factor similar to what is 
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used in a speech recognition environment, such factor serv- of the speaker; (b) decoding the first spoken utterances; (c) 
ing to significantly reduce the size of fast match lists and generating a sub-list of speaker candidates that substantially 
speed up the procedure for selecting candidate speakers match the speakers decoded spoken utterances; (d) activat- 
ors) from a database. By way of analogy, a fast match ing databases respectively corresponding to the speaker 
technique employed in the speech recognition environment 5 candidates in the sub-list, the databases containing informa- 
is disclosed in the article by L. R. Bahl et al., "A Fast tion respectively attributable to the speaker candidates; (e) 
Approximate Acoustic Match for Large Vocabulary Speech performing a voice classification analysis on voice charac- 
Recognition", IEEE Trans. Speech and Audio Proa, Vol. 1, te ? stics of the speaker; (0 eliminating speaker candidates 
59-67 ('1993) not su " staatia lly match these characteristics; (g) 

querying the speaker with at least one question that is 

SUMMARY OF THE INVENTION relevant to the information in the databases of speaker 

candidates remaining after the step (f); (h) further eliminat- 

It is an object of the present invention to provide methods ing speaker candidates based on the accuracy of the answer 
and apparatus for providing secure access to services and/or provided by the speaker in response to the at least one 
facilities which preferably utilize random questioning, auto- question; (i) further performing the voice classification 
matic speech recognition (ASR) and text-independent 15 analysis on the voice characteristics from the answer pro- 
speaker recognition techniques. Also, indicia contained in vided b Y the speaker; (j) still further eliminating speaker 
spoken utterances provided by the speaker may serve as candidates who do not substantially match these character- 
additional information about the speaker which may be used istics ; ™ d 00 iteratively repeating steps (g) through (j) until 
throughout a variety of steps of the invention. one of ° P ne s P eak f undulate and no speaker candidates 
- . 20 remain, it one speaker candidate remains then permitting the 

In one aspect of the present invention, a method of speaker access and if no speaker candidate remains then 

controlling access of a speaker to one of a service and a denying the speaker access. Of course, it is possible to repeat 

facility comprises the steps of: (a) receiving first spoken the entire process if no speaker candidate is chosen or a 

utterances of the speaker, the first spoken utterances con- system provider may choose another appropriate course of 

taining indicia of the speaker; (b) decoding the first spoken 25 action. 

utterances; (c) accessing a database corresponding to the Again, as mentioned above, the method of the invention 

decoded first spoken utterances, the database containing may be used for identification and/or verification without 

information attributable to a speaker candidate having indi- any explicit identification given by the user (e.g., name). By 

cia substantially similar to the speaker; (d) querying the checking the type of request made by the user, using 

speaker with at least one random (but questions could be 30 additional information, if provided, and by using the acous- 

non-random) question (but preferably more than one random tic match, discussed above, user identification may be estab- 

question) based on the information contained in the accessed lished. Further, by using the random questions in addition to 

database; (e) receiving second spoken utterances of the the acoustic identification, a more accurate identification is 

speaker, the second spoken utterances being representative achieved in almost any type of environment, 

of at least one answer to the at least one random question; 35 \t [ s therefore an object of the invention to provide 

(f) decoding the second spoken utterances; (g) verifying the apparatus and methods which: use external information to 

accuracy of the decoded answer against the information build user's models; extract non-feature -based information 

contained in the accessed database serving as the basis for f r0 m the acoustic properties of the speech to build user's 

the question; (h) taking a voice sample from the utterances models; extract non-acoustic information from the speech to 

of the speaker and processing the voice sample against an 40 build user's models; drives the conversations to request 

acoustic model attributable to the speaker candidate; (i) specific information; decodes and understands the answers 

generating a score corresponding to the accuracy of the to these questions; compares the answers to information 

decoded answer and the closeness of the match between the stored in a database; and build user's model on answers to 

voice sample and the model; and (j) comparing the score to the questions. 

a predetermined threshold value and if the score is one of 45 Xhe reS ulting system is a combination of technology: 

substantially equivalent to and above the threshold value, text-independent speaker recognition, speech recognition 

then permitting speaker access to one of the service and the and natura i language understanding. It is also possible to add 

facility. If the score does not fall within the above preferred new questions, decode and understand the answer and add 

range, then access may be denied to the speaker, the process question in the pool of the random questions for next 

may be repeated in order to obtain a new score, or a system 5Q access request by the same user 

provider may decide on another appropriate course of action. , t fc alsQ tQ be appreciated that the methods and apparatus 

In a first embodiment, the indicia may include identifying described herein use voice prints (speaker recognition), 

indicia, such as a name, address, customer number, etc., speech recognition, natural language understanding, acous- 

from which the identity claim may be made. However, in tic and content analysis to build a new biometric. Such a 

another embodiment, the identity claim may have already 55 speech biometric contains acoustic information, semantic 

been made by the potential user keying in (or card swiping) information, static and dynamic information, as will be 

a customer number or social security number, for example, explained, and is also a knowledge based system. However, 

in which case the indicia includes verifying indicia in order while the invention utilizes knowledge known by the user 

to aid in the verification of the identity claim. Also, the and knowledge acquired by the speech recognition engine 

indicia may serve as additional information about the user 6 o (e.g., speech rate, accent, preferred vocabulary, preferred 

which may serve as static and/or dynamic parameters in requests), the combination thereof provides advantages 

building or updating the user's acoustic model. muc h greater than the advantages respectively associated 

In another aspect of the present invention, a method of with each individual aspect. Such a formation of this unique 

controlling access of a speaker to one of a service and a speech biometric including voice prints and knowledge 

facility from among a multiplicity of speaker candidates 65 based systems has, prior to this invention, been unknown 

comprises the steps of: (a) receiving first spoken utterances since the two concepts have previously been considered 

of the speaker, the first spoken utterances containing indicia substantially mutually exclusive concepts. 
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The overall system provides a security level with an the same reference numeral. A potential user 12 of the 

arbitrary level of security with speech and speaker recogni- service/facility performs the following operation in coop- 

tion technology and natural language understanding. This eration with security system 10 in order to gain access to the 

global architecture has the advantage of being universal and service/facility. The user 12 calls a central server 22 via link 

adaptable to substantially any situation. The complete trans- 5 24. The user 12 identifies himself via his name, for example, 

action is monitored so that possible problems can be and requests access to the service/facihty. The central server 

detected in using this large amount of data and flags are 22 then performs the following operations. The server 22 

raised for further processing for action by the service pro- submits the utterance of the user's name and request to 

v ^ er automatic speech recognizer (ASR) 28 via link 26 which 

decodes tne utterance and submits the decoded name and 

Tnese and other objects, features and advantages of the 10 fequest back tQ server %1 R ^ tQ be appreciated thaU while 

present invention will become apparent from the following preferable, the name of the speaker is not mandatory in 

detailed description of illustrative embodiments thereof, establishing the identity claim. The identity claim may be 

which is to be read in connection with the accompanying ma de from other information provided by the speaker or 

drawings. voice characteristics, as explained herein. Also, the identity 

15 claim may be established by the user keying in or using a 

BRIEF DESCRIPTION OF THE DRAWINGS magnetic strip card to provide an identification number. The 

. . i_ *. /li i j * -11 * *• *u server 22 then accesses a database (which is part of the user 

FIG. 1 .s a flow chart/block diagram illustratmg he databases vja ^ 30 cor ^ di P to the user 

functional interconnection between components of the (candidate) identified during the identification claim. As will 

invention, 20 ^ e explain^ the user database contains information specific 

FIG. 2 is a flow chart/block diagram further illustrating to that particular user. Also, an acoustic model 20 pertaining 

components of the invention; to that user (as will be explained) is selected from the user's 

FIG. 3 is a flow chart/block diagram illustrating an database through link 32 and provided to the central server 

iterative procedure performed according to the invention; 22 v * a UQ k 34- 

FIG. 4 is a block diagram illustrating a user database 25 Next utilizing the specific information from the identified 

j • . 4 , • , _ _ i user s database, the server 22 generates a random question 

according to the invention and ' . & . \ 

„ .„ , , (or multiple random questions) for the user via link 36. The 

FIG. 5 is a flow chart/block diagram illustrating the usef angwers ^ random question(s) which ^ ^ 5ack t0 

generation of a user model according to the invention. ^ 22 yia Unk 38 R shoukJ be understood that links 

DETAILED DESCRIPTION OF PREFERRED 30 2 <*. 36 and 38 are preferably provided over a single com- 

EMBODIMENTS munication path which may be hardwired (e.g. PSTN) or 

wireless (e.g. cellular). The separation of links is meant to 

Referring initially to FIG. 1, a flow chart/block diagram of illustrate functionality rather than physical implementation, 

the basic components of the invention is shown. The inven- The central server 22 receives the user's answer and 

tion employs a unique combination of random questions, 35 processes it through ASR 28. After decoding the answer, 

automatic speech recognition (ASR) and text-independent ASR 28 passes the decoded answer to a semantic analyzer 

speaker recognition to provide a significant improvement in 40 via link 42. The semantic analyzer 40 analyzes the answer 

secure access to services and/or facilities (as discussed to determine if the answer is correct, or not, in accordance 

previously) requiring security measures. Specifically, a user w i t h the information in the user's database. The result of the 

(block 12) requesting access to a service/facility is subjected 40 semantic analyzer 40 is sent to a score estimator 44 via link 

to a security system 10 employing a combination of random 45 where a partial score associated with the answer received 

questions, ASR and text-independent speaker recognition fj 0m me user is generated. It should be understood that the 

(block 14) via an iterative process (loop 16) whereby the lack of a "perfect" partial score does not necessarily indicate 

security system 10 utilizes user databases of non-acoustic an incorrect answer from the user due to the fact that known 

information (block 18) and/or an acoustic user model (block 45 speech recognition processes, such as employed by ASR 28, 

20) to perform the verification/identification of the user 12. have acceptable recognition error rates associated therewith 

These components and their specific interaction will be an d, thus, while the actual answer is correct, the decoded 

explained below in the context of the remaining figures. answer may be close enough to satisfy the semantic analyzer 

It is to be understood that the components described 40. Also, it is to be understood that some speech recognition 

herein in accordance with the invention may be irnple- 50 and natural language understanding techniques may have 

mented in hardware, software, or a combination thereof. recognition and/or understanding errors associated therewith 

Preferably, the invention is implemented in software in the such that, as a result, they do not correctly recognize and/or 

form of functional software modules on an appropriately understand the answer provided by the speaker. Hence, in 

programmed general purpose digital computer or computers. such cases, it is preferred that more than one random 

The actual location of the computer system implementing 55 question be asked i prior to making a decision to permit or 

the invention is not critical to the invention; however, in an deny access to the speaker. Links 46, 48 and 50 from the 

application where the user is requesting remote access via score estimator 44 go back to the central server 22 to 

telephone, the invention or portions thereof may reside at the indicate whether the answer was correct, not correct, or for 

service /facility location or some location remote thereto. some reason (e.g., ASR could not properly decode the 

Further, the invention may be implemented in an internet go response), the answer was not understood and the answer 

environment in which case various portions of the invention should be repeated by the user 12. The question and answer 

may reside at the user's location and/or the service providers process between the user 12 and the central server 22 may 

location. continue for as many iterations as are desired to substantially 

Referring now' to FIG. 2, one embodiment of the ensure that the potential user is the user associated with the 

invention, illustrated via a flow chart/block diagram, is 65 subject user database. 

shown. It is to be understood that same or similar compo- Substantially simultaneous with the spoken utterances 

nents illustrated throughout the figures are designated with provided by the user 12, the central server 22 may process 
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a user voice sample, for example, from the initial dialog recorders or speech synthesizers cannot fool the system of 

from the potential user and/or from the answer or answers the invention as such fraudulent processes cannot handle the 

uttered by the potential user, 'through a text-independent random questions and/or the dialog in real-time. Still further, 

speaker recognition module 52 via link 54 in order to verify for a frauder to know the answers to all the questions will not 

(confirm) the user's identity. The module 52 utilizes the user 5 help gain access to the service or facility, since if the frauder 

model previously built (as will be explained in the context has a different speech rate or voice print, for example, then 

of FIG. 5) and which is presented to module 52 from user the identification and verification claim will fail. For these 

model 20 via link 51. It is to be appreciated that such speaker reasons, the actual user is encouraged to answer with a 

recognition process of module 52 is preferably implemented natural full sentence, that is, to establish a full dialog, 

by default, regardless of the partial score(s) achieved in the 1Q It is further to be understood that the system of the 

question/answer phase, in order to provide an additional invention is capable of building more questions, either by 

measure of security with regard to service/facility access. learning about the user or, after identifying the user, asking 

The voice sample is processed against the user model and new questions and using the answers (which are transcribed 

another partial score is generated by the score estimator 44. and understood) as the expected answers to future random 

Based on a comparison of a combination of the partial scores 35 questions. 

(from the question/answer phase and the background Accordingly, it is to be appreciated that the invention can 

speaker verification provided by module 52) versus a pre- build databases and models both automatically and manu- 

determined threshold value, the central server 22 decides ally. Automatic enrollment is performed by obtaining the 

whether or not to permit access to the service/facility to the name, address and whatever other identification tag that the 

user 12. If the combined score is above or within an 2Q service/facility desires and then building, from scratch, 

acceptable predetermined range of the threshold value, the models and compiling data usable for future identification or 

central server 22 may permit access, else the server may verification. Beyond the ability to self-enroll users, the 

decide to deny access completely or merely repeat the system of the invention provides the ability to automatically 

process. Further, a service provider may decide to take other adapt, improve or modify its authentication processes. Still 

appropriate security actions. 25 further, the automatic nature of the invention permits the 

The user model 20 may also be operatively coupled to the building of a user profile for any purpose including the 

central server 22 (link 34), the semantic analyzer 40 (link possibility of having other self-enrolling, self-validating 

53), the score estimator 44 (link 56) and the ASR 28 (link and/or self-updating biometrics (e.g., face patterns for face 

55) in order to provide a data path therebetween for the recognition, iris recognition, etc.). Thus, it is possible to 

processes, to be further explained, performed by each mod- 30 combine biometrics (speech, voiceprint) in order to have 

ule. Link 58 between the text-independent speaker recogni- self-enrolling biometrics. Self -validation is also provided 

tion module 52 and the score estimator 44 is preferably such that whenever a score associated with the biometric 

provided in order to permit the module 52 to report the match is poor, the present invention may be used to still 

results of the voice sample/model comparison to the esti- admit the person but also to correct the models on the 

mator 44. 35 assumption that they are outdated. 

Also, it is to be understood that because the components It is to be appreciated that several variations to the 

of the invention described herein are preferably imple- above-described security access process are possible. For 

mented as software modules, the actual links shown in the example, if a caller calls the central server 22 for the first 

figures may differ depending on the manner in which the time and the system has a database of information pertaining 

invention is programmed. 40 to the caller but does not have an acoustic model set up for 

It is to be appreciated that portions of the information in that caller, the following procedure may be performed. The 
each database and the user models may be built by a central server 22 asks a plurality of questions firom the 
pre -enrollment process. This may be accomplished in a database, the number of questions depending upon the 
variety of ways. The speaker may call into the system and, known average error rate associated with ASR 28 and 
after making an identity claim, the system asks questions 45 semantic analyzer 40, Then, based only on the scores 
and uses the answers to build acoustic and non -acoustic achieved by the answers received to the questions, the server 
models and to improve the models throughout the entire 22 makes a determination whether or not to permit access to 
interaction and during future interactions. Also, the user may the caller. However, the system collects voice samples from 
provide information in advance (pre -enrollment) through the caller's answers to the plurality of questions and builds 
processes such as mailing back a completed informational 50 a user voice model (e.g., user model 20) therefrom, 
form with similar questions as asked during enrollment over Accordingly, the next time the caller calls, the server 22 need 
the phone. Then, an operator manually inputs the informa- ask only a few random questions from the database and, in 
tion specific to the user into the system. Alternatively, the addition, use the text-independent speaker recognition mod- 
user may interact with a human operator who asks questions ule 52 along with the new user model to verify his identity, 
and then inputs answers to questions into the system. Still 55 as explained above. 

further, the user may complete a web (internet) question/ Many ways for communicating the random questions to 

answer form, or use e-mail, or answer questions from an the user maybe envisioned by one of ordinary skill in the art. 

I VR (Integrated Voice Response) system. Also, it is to be For instance, if the user is attempting to access the service/ 

appreciated that the questions may preferably be relatively facility through a web page, the questions may be presented 

simple (e.g., what is your favorite color?) or more complex, 60 in text form. If access is attempted over a telephone line, the 

depending on the application. The simpler the question, the questions may be asked via a voice synthesizer, a pre- 

more likely it is that the actual user will not be denied access recorded tape or a human operator. The actual method of 

simply because he forgot his answers. asking the questions is not critical to the invention. 

Furthermore, because of the fact that text-independent Alternatively, it is to be appreciated that at least a portion of 

speaker recognition is performed in the background and the 65 the answers provided by the potential user may be in a form 

questions are random in nature and the entire process is other than speech, i.e., text format, keyed-in information, 

monitored throughout the dialog, even frauders using tape etc. 
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Afurther variation to the above -described system includes A variation to the above embodiment includes the sce- 
an embodiment wherein the inventive security system is nario wherein the user 12 provides a password to the server 
implemented in a user's personal computer (at his home or 22 which he shares with a group of other users. In this way, 
office) to which the user seeks access. In such a scenario, a the server 22 initially narrows the possible candidate data- 
module substantially equivalent to the central server module 5 bases from the set of databases 18 before asking any 
may use such information as the last time the user received questions or accessing the voice classification module 68. 
a facsimile on the computer, etc., to decide whether or not 0 nce the list of user databases are limited by the password, 
to allow access. Specifically, an ASR/semantic analyzer theQ thc aboye ileratiye ^ be erformed t0 
and/or a speaker recognition module, such as those dis- idcQtif one of ^ candidates or t0 exclude all candidates . 
cussed above, may be implemented in the user s computer . . , 
to perform the verification process discussed herein. One of 10 As mentioned previously while multiple links between 
ordinary skill in the art will appreciate further variations to m 1 odules are ******** 10 the figures, they are meant to 
the above-described embodiments given the inventive leach- illustrate functional interaction between modules rather than 
ings disclosed herein. aclual P nvsical interconnection, although a physical lmple- 
n^. * 'rrr< i *u uj- *r*u mentation similar thereto may be employed. Also, one of 
Referring now to FIG. 3, another embodiment of the 1C ,. . ... . , -n . , r * , ■ • 

.„ . . j . a * . it i * j • . 2b ordinary skill in the art will appreciate further variations to 

invention, illustrated via a flow chart/block diagram, is . . . .« . i_ j * * 

, o . , ° c the above -described embodiments given the inventive teach- 

shown. Such an embodiment provides a security system for , , , ^ 

. . t , * 4 J . ings disclosed herein, 

processing a large number of users who are attempting to °^ 

access a service/facility simultaneously or for processing a Referring now to FIG. 4, a block diagram illustrating the 

large user database. Specifically, the embodiment described 20 possible types of information contained in a user database 18 

below provides identification in a verification process with * shown - ne use of such non-acoustic information, as 

respect to a speaker via an iterative procedure that reduces previously explained, significantly improves the perfor- 

a set of candidate speakers at each step via random ques- mance of the security measures described with respect to the 

tioning and voice identification on such set of candidate invention. In addition, a variety of acoustic information may 

speakers until one candidate or no candidates are left. 25 be included in the databases. 

Again, a potential user 12 calls a central server 22 via link Tn e information may be categorized as information 

24, identifies himself and requests access to the particular exhibiting static features, i.e. information that does not 

service/facility. Next, the server 22 provides the name of the change or changes slowly or periodically with time (block 

user to a speaker-independent ASR 60 via link 62. In 1° 2 )» and information exhibiting dynamic features, i.e., 

response, the ASR 60 decodes the caller's name and pro- 30 information that changes quickly or non-periodically with 

vides a list of candidate names that match the caller's tim e (block 104). In other words, static information is a 

acoustic utterance to the server 22 via link 64. However, as function of the caller/user and dynamic information is a 

mentioned above, even if the speaker doesn't provide his/her function of the request. Static information may be either 

name, the system may use other information and voice internal (block 106) or external (block 108). Examples of 

characteristics to generate the list of candidates. 35 external information are phone numbers, time of day, etc. 

Next, personal databases (block 66) of users with names Internal information may be further categorized as informa- 

from the list are activated. These databases are activated tion extracted from the dialog between the user and the 

through the larger set of databases 18 of all the service/ such as gender, speech rate, accent, etc., or informa- 

facility users via links 63 and 65. The selected databases tion decoded from the dialog by the ASR, such as name, 

contain personal user data, such as age, profession, family 4 o address, date of birth > family status, etc. On the other hand, 

status, etc., as well as in formation about the user's voice, dynamic information may include information regarding the 

such as prototypes, prosody, speech rate, accent, etc. The caller's trips, meetings with people, facsimile and e-mail 

types of information in the databases will be described later information. For instance, if the system of the invention is 

in greater detail implemented on the user's computer, as previously 

Still further, a voice classification module 68 is accessed 45 mentioned, then the system may query the user who is 

via link 69. The module 68, which performs a voice clas- seeking remote access thereto by asking whether the user 

sification analysis, checks for certain voice characteristics of receive t d e " mai1 * Tors \ a P articular P ereon on a particular day. 

the caller and browses the selected databases 66 via link 67 11 * t0 appreciated that the present invention can dynami- 

to eliminate users who do not fit these characteristics, thus cal * create new questions (from information provided in 

narrowing the list of possible candidates. Next, a random so ™l-time), understand the respective answers and then use 

question relevant to the user databases that remain as the ^formation during the next transaction. Automatic 

candidates after the voice classification analysis is presented enro lment of a new user ma y also be accomplished in a 

to the user via link 70. The user provides his answer to the similar manner. 

question via link 72 and the server 22, via the ASR 60, uses Referring now to FIG. 5, a flow chart/block diagram 
the answer to eliminate more of the candidates. Further, the 55 illustrating the generation of user model 20, formed accord- 
server 22 uses the user's voice sample from the answer to ing to the invention, is shown. As previously explained, a 
run a more precise voice classification analysis via module user model is employed to estimate a probability of a 
68 to reduce the list of candidates even further. This proce- particular user's identity. The model is utilized in accor- 
dure continues iteratively with more random questions and dance with the text-independent speaker recognition module 
with more detailed levels of speaker classification analysis 60 52 (^ G - 2 ) and the voice classification module 68 (FIG. 3). 
until one or none of the candidates remain. As mentioned, 11 is to be appreciated that the static and dynamic informa- 
the use of random questions in an iterative process makes the lion utilized in the models is usually distinct from any other 
fraudulent use of recorders or synthesizers to fool the system information provided and utilized in response to the random 
substantially useless. Also, the use of a relatively large questions, but this is not a necessary condition, 
quantity of random questions overcomes the known problem 65 The user information that was described with respect to 
of speech recognition and natural language understanding FIG. 4 may be advantageously used to generate a model of 
techniques making recognition and understanding errors. users in order to enhance the text-independent speaker 



12/23/2003, EAST version: 1.4.1 



5,8! 

11 

recognition process performed by module 52 (FIG. 2) and 
the voice classification process performed by module 68 
(FIG. 3). It is to be understood that such a model does not 
produce a user's acoustic score but rather estimates a prob- 
ability of a given user's identity from a known user's 
database. In one particular form, one can interpret a search 
of a best matching speaker with respect to acoustic data as 
a maximum value of a function defined as the conditional 
probability of a speaker (speaker^ given the acoustic data 
utilized from the speaker dialog, i.e., P(speaker 1 |acoustic 
data). It is generally known in speech recognition that such 
a conditional probability may be computed by converting 
such equation to P( acoustic datalspeakerj P(speaker I ). This 
general speech recognition equation is designated as block 
202 in FIG. 5. It is further to be understood that P(acoustic 
data|speaker ( ) may be computed using some acoustic models 
for speakers that may be represented as Hidden Markov 
Models (HMM). In another embodiment, one can interpret 
P(speaker / ) as a weighted factor and update a general 
speaker score using a known formula. 

As long as there is a satisfactory U-measure (block 204) 
on a user database, one can apply the strategy used in speech 
recognition to reduce the size of short lists of speakers and 
exclude speakers from the acoustic recognition process as 
long as their U-measure is below some predetermined 
threshold value. U-measure calculation is a statistical mea- 
sure on a set of users. The term "U-measure" simply refers 
to a user measure or a measure on a user population. The 
measure may be any standard statistical measure on some set 
of events. In the context of the invention, the events are that 
some users from a set of users will try to access a service 
and/or facility. The measure on the set is used to derive a 
probability that some event or set of events may occur. A 
standard reference of probabilistic measure on some set may 
be found in the reference: Encyclopedia of Mathematics, 
Vol. 6, Kluwer Academic Publishers, edited by M. 
Hazewinkel, London (1990). 

In order to estimate the U-measure for all users in the 
database, one can use one of the following procedures. First, 
one may introduce some static parameters (features) that 
characterize system users, i.e. profession, sex, hobby, etc., 
and denote them as SI, S2, S3, . . . S y -. Likewise, dynamic 
parameters (features) may be introduced, i.e. age, time when 
a person attempts to access the service/facility, location from 
which the caller is calling, etc. and denote them as Dl, D2, 
D3, . . . D k . Now, one can estimate P(Sy|D fc ) from training 
samples (block 208b) for some users within the overall user 
database (208). The estimation of the static parameters S y - 
and dynamic parameters X) k are respectively done in blocks 
210 and 212, Then, for any new user (block 208a), one can 
estimate his U -score, such as the product of all P(S y |D ft ) 
where S y - are taken from the new user's database. 

Additional special parameters can be introduced in the set 
S y -. Such parameters may be vocabulary, prosody, speech 
rate, accent, etc. Essentially, this data can be obtained from 
the acoustic front -end of the automatic speech recognizer 
prior to speech recognition. 

Other measures which represent the probability of a 
speaker given a parameter, including more complex models, 
may be employed and, as a result, the invention is not 
limited to the use of a U-measure of the static and dynamic 
parameters described herein. 

In addition, the speaker identity claim can be done auto- 
matically via a speech and speaker recognition process. At 
the beginning of the conversation between the user and the 
central server, the speaker will provide his name and some 



)7,616 

12 

information such as his address or the object of his request. 
An acoustic front-end in the speech recognizer (such as ASR 
28 in FIG. 2) extracts the acoustic features associated with 
these first spoken utterances. The utterances are recognized 

5 and processed by a natural language understanding module 
within the speech recognizer in order to identify the name of 
the user and the address, if available. The stream of acoustic 
features are also fed to a text- independent speaker recogni- 
tion module (such as module 52 in FIG. 2) which provides 

10 the system with a list of potential users. By searching for 
matches between the recognized name and the top-ranked 
identified speakers, the identity claim is obtained. 
Alternatively, the recognition of the name reduces the popu- 
lation of candidates to a subset, namely, the set of speakers 

15 with the same name. Then, the speaker identification process 
establishes the correct identity claim. Similarly, the 
requested service may be automatically recognized and 
different levels of security or tolerance may be established 
based on the type of request. Also, both approaches may be 

20 combined. The system may recognize the name and address 
of the speaker; however, recognition and/or understanding 
errors may occur and/or a list of speakers with the same 
name/address may exist. Thus, by using speaker recognition 
(preferably, text-independent) on the same data, the list of 

25 candidates may be reduced to a small number or to one 
particular user. If a substantial set of candidates still exist, 
random questions may be used to decide who the speaker is 
before going through the verification process. 

The advantage of having a text-independent speaker rec- 

30 ognition engine is apparent when the actual service is 
provided. The stream of acoustic features fed to the speech 
recognition engine and its natural language understanding 
module may also be fed to the text-independent speaker 
recognition module which runs in the background and 

35 verifies that over the whole interaction, using a large amount 
of test data, that the speaker verification still matches. 
Advantageously, problems can be flagged and depending on 
the service, the service may be interrupted or an operator 
may be called or a subsequent verification may be requested 

40 whereby the transaction is temporarily put on hold until the 
re-verification is accomplished. 

The following is one example of an implementation of the 
speaker verification principles described herein. However, it 
is to be appreciated that the present invention is not limited 

45 to this particular example and that one of ordinary skill in the 
art will contemplate many other implementations given the 
teachings described herein. 

The feature vectors (obtained as an output of the acoustic 
front-end of the speech recognizer) are of the mel cepstral, 

50 delta and delta-delta type (including CO energy). These 
feature vectors are 39 dimension vectors and are computed 
on frames of about 25 milliseconds with shifts of about 10 
milliseconds. It is to be appreciated that the speaker recog- 
nition module and the speech recognizer may use the same 

55 types of feature vectors; however, this is not critical to the 
invention. The speaker identifier is a vector quantizer which 
stores, during enrollment, a minimum of information about 
each speaker. All the input feature vectors are clustered in a 
set of about 65 codewords. Typically, about 10 seconds, of 

60 speech are required for enrollment. This is easily obtained as 
the new user enrolls all of his aliases. However, when the 
user interacts with the system rather than for the purpose of 
enrolling his aliases, data obtained may be used to build 
acoustic models of the voice prints. In practice, all the data 

65 from the user enrollment is used. Note that when a new 
speaker is enrolled, it does not affect any of the previous 
models. 
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When an enrolled speaker uses the system, the acoustic recognition). Preferably, a natural language understanding 
features are computed and simultaneously given to the module which relies on a combination of classical parsing 
speaker identification system and to the speech recognizer. capabilities with key word spotting and speech recognition 
The speaker identification/verification/classification phase is may be employed, such as disclosed in the article "Statistical 
implemented as a vector quantizer decoder. On a frame by 5 Natural Language Understanding Using Hidden 
frame basis, it identifies the closest codebook (or ranks the dumpings", M Epstein et al., ICASSP Proceedings, pg. 
N closest codebooks). An histogram is created which counts 176, Vol. 1, (1996). However, other methods of implement- 
how many frames have been selected for each codebook. ing natural language understanding may be employed. 
The codebook which is most often selected identifies the Furthermore, a voice classification module such as is 
potential speaker. By looking at the average distance from 10 disclosed in either U.S. Ser. No. 08/787,031 or the patent 
the closest codebook, it is possible to detect new users. In application (U.S. Ser. No. 08/851,982) entitled, "Speaker 
this case, the new user is then prompted with an enrollment Recognition Over Large Population with Combined Fast and 
menu in order to perform the enrollment process. Detailed Matches", filed on May 6, 1997, may be employed 

Different embodiments may be employed for the text- to perform the functions of the voice classification module 
independent speaker verifier. Again, the feature vectors 15 68. A score estimator such as is disclosed in U.S. Pat. No. 
(obtained as the output of the acoustic front-end) are of the 5,502,774 (Bellegarda et al.) may be employed to perform 
mel cepstral, delta, and delta-delta type (including CO the functions of the score estimator 44. A semantic analyzer 
energy). They are preferably 39 dimension vectors and are as is disclosed in either of the articles: G. Gazdar and C. 
computed at the user end or at the server end. The features Mellish, Natural Language Processing in PROLOG — An 
are usually computed on frames of about 25 milliseconds 20 Introduction to Computational Linguistics (1989); P. Jacobs 
with shifts of about 10 milliseconds. The speaker verifier is and L. Rau, Innovation in Text Interpretation, Artificial 
preferably a vector quantizer which stores, during Intelligence, Vol. 63, pg. 143-191 (1993); W. Zadrozny et 
enrollment, a minimum of information about each speaker. al., "Natural Language Understanding with a Grammar of 
All the input feature vectors are clustered in a set of about Constructions", Proceedings of the International Conference 
65 codewords (centroids and variances). Typically, about 10 25 on Computational Linguistics (August 1994), 
seconds of speech are required for enrollment. The speaker Although the illustrative embodiments of the present 
verification phase is implemented as a vector quantizer invention have been described herein with reference to the 
decoder. On a frame by frame basis, the closest codebook is accompanying drawings, it is to be understood that the 
identified or the N closest codebooks are ranked. An histo- invention is not limited to those precise embodiments, and 
gram is created which counts how many frames have 30 that various other changes and modifications may be 
selected each codebook. The codebook which is most affected therein by one skilled in the art without departing 
selected identifies the potential speaker. Acceptance or rejec- from the scope and spirit of the invention, 
tion of test speakers is based on the average distance from What is claimed is: 

the codebooks of the testing vectors versus the average 1. A method of controlling access of a speaker to one of 
variance of the codebook provided that the identified 35 a service and a facility, the method comprising the steps of: 
speaker matches the identity claim and by comparing the 
scores to the scores obtained from "cohorts" of the speaker 
as described in U.S. Ser No. 08/788,471. Cohorts are sets of 
similarly sounding speakers who are in the database. The 
verification results from a competition between the speaker 40 
model and the models of the cohorts or background (new 
model built over the whole cohort group) models. The 
identity claim is tried over all the users who have access to 
the function protected by the system. The speaker classifi- 
cation is performed by doing identification with models 45 
obtained by clustering close codebooks associated with 
similar speakers. 

It is to be appreciated that, in order to implement the 
embodiments described herein, various existing components 
may be implemented. For instance, a speech recognizer 50 
(such as shown as ASR 28 in FIG. 2 and speaker- 
independent ASR 60 in FIG. 3) may be implemented using 
a classical large vocabulary speech recognition engine using 
Hidden Markov Models, mixtures of Gaussian probabilities, 
mel cepstral vectors as acoustic features, a 20K or 64K 55 
vocabulary and a trigram language model. Such systems, for 
example, are disclosed in IBM's speech dictation engine 
"Simply Speaking" and in "Transcription of Radio Broad- 
cast News with the IBM Large Vocabulary Speech Recog- 
nition System", P, S. Goplalakrishnan et al., Proceeding 60 
Speech Recognition Workshop, Arden House, Drapa (1996). 
Further, while it is preferred that a text-independent speaker 
recognition module (module 52 of FIG. 2) is utilized, 
text-dependent or text-prompted speaker recognition mod- 
ules may also be used. Such systems are described in the 65 
Furui article (text-dependent speaker recognition) and in 
U.S. Ser. No. 08/788,471 (text-independent speaker one of denying access to the speaker and repeating the 



(a) receiving first spoken utterances of the speaker, the 
first spoken utterances containing indicia of the 
speaker; 

(b) decoding the first spoken, utterances; 

(c) accessing a database corresponding to the decoded 
first spoken utterances, the database containing infor- 
mation attributable to a speaker candidate having indi- 
cia substantially similar to the speaker; 

(d) querying the speaker with at least one question based 
on the information contained in the accessed database; 

(e) receiving second spoken utterances of the speaker, the 
second spoken utterances being representative of at 
least one answer to the at least one question; 

(f) decoding the second spoken utterances; 

(g) verifying the accuracy of the decoded answer against 
the information contained in the accessed database 
serving as the basis for the question; 

(h) taking a voice sample from the utterances of the 
speaker and processing the voice sample against an 
acoustic model attributable to the speaker candidate 
without requiring dependency on the decoded first and 
second spoken utterances; 

(i) generating a score corresponding to the accuracy of the 
decoded answer and the closeness of the match 
between the voice sample and the model; and 

(j) comparing the score to a predetermined threshold 
value and if the score is one of substantially equivalent 
to and above the threshold value, then permitting 
speaker access to one of the service and the facility. 

2. The method of claim 1, further comprising the step of 
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process if the score is not substantially equivalent to and not 
above the threshold value. 

3. The method of claim 1, wherein the acoustic model 
attributable to the speaker is not previously available and the 
method further comprising the steps of: 5 

(a) querying the speaker with a plurality of questions 
based on the information contained in the accessed 
database; and 

(b) building the acoustic model from voice samples taken 
from a plurality of answers provided by the speaker in 10 
response to the plurality of questions. 

4. The method of claim 1, wherein the indicia in the first 
spoken utterances includes identifying indicia. 

5. The method of claim 1, wherein the indicia in the first 
spoken utterances includes verifying indicia. 15 

6. The method of claim 1, wherein at least a portion of the 
information contained in the database is one of acoustic and 
non-acoustic information. 

7. The method of claim 1, wherein at least a portion of the 
information contained in the database is derived from spo- 20 
ken utterances provided by the speaker prior to the decoding 
step. 

8. The method of claim 1, wherein at least a portion of the 
information contained in the database is derived from 
decoded spoken utterances provided by the speaker. 

9. The method of claim 1, wherein at least a portion of the 
information in the database has static features. 

10. The method of claim 1, wherein at least a portion of 
the information in the database has dynamic features, 

11. The method of claim 1, wherein the sub-step of 
processing the voice sample against the acoustic model is 
performed by a text-independent speaker recognition tech- 
nique. 

12. The method of claim 1, further comprising the step of 
requerying the at least one question if the at least one answer 
is not accepted during the decoding step. 

13. The method of claim 1, wherein the steps of the 
method of controlling access of the speaker forms a speech 
biometric. 

14. The method of claim 1, wherein one of the database 40 
and the model may be built through pre -enrollment of the 
speaker. 

15. The method of claim 1, wherein one of the database 
and the model may be one of built and updated automatically 
during the method of controlling access of the speaker to one 45 
of the service and the facility. 

16. The method of claim 1, wherein at least a portion of 
the at least one answer provided by the speaker is in a form 
other than speech. 

17. The method of claim 1, wherein the sub-step of 50 
processing the voice sample against the acoustic model is 
performed on a random question. 

18. A method of controlling access of a speaker to one of 
a service and a facility from among a multiplicity of speaker 
candidates, the method comprising the steps of: ss 

(a) receiving first spoken utterances of the speaker, the 
first spoken utterances containing indicia of the 
speaker; 

(b) decoding the first spoken utterances; 60 

(c) generating a sub-list of speaker candidates that sub- 
stantially match the speakers decoded spoken utter- 
ances; 

(d) activating databases respectively corresponding to the 
speaker candidates in the sub-list, the databases con- 65 
taining information respectively attributable to the 
speaker candidates; 
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(e) performing a voice classification analysis on voice 
characteristics of the speaker without requiring depen- 
dency on the decoded first spoken utterance; 

(f) eliminating speaker candidates who do not substan- 
tially match these characteristics; 

(g) querying the speaker with at least one question that is 
relevant to the information in the databases of speaker 
candidates remaining after the step (f); 

(h) further eliminating speaker candidates based on the 
accuracy of the answer provided by the speaker in 
response to the at least one question; 

(i) further performing the voice classification analysis on 
the voice characteristics from the answer provided by 
the speaker without requiring dependency on the 
decoded first spoken utterance; 

(j) still further eliminating speaker candidates who do not 
substantially match these characteristics; and 

(k) iteratively repeating steps (g) through (j) until one of 
one speaker candidate and no speaker candidates 
remain, if one speaker candidate remains then permit- 
ting the speaker access and if no speaker candidate 
remains then denying the speaker access. 

19. The method of claim 18, wherein the first spoken 
utterances of the speaker are also representative of a pass- 
word which is shared by a sub-set of the multiplicity of 
speaker candidates. 

20. The method of claim 18, wherein the indicia in the 
first spoken utterances includes identifying indicia. 

21. The method of claim 18, wherein the indicia in the 
first spoken utterances includes verifying indicia. 

22. The method of claim 18, wherein at least a portion of 
the information contained in the database is one of acoustic 
and non-acoustic information. 

23. The method of claim 18, wherein at least a portion of 
the information contained in the database is derived from 
spoken utterances provided by the speaker prior to the 
decoding step. 

24. The method of claim 18, wherein at least a portion of 
the information contained in the database is derived from 
decoded spoken utterances provided by the speaker. 

25. The method of claim 18, wherein at least a portion of 
the information in the database has static features. 

26. The method of claim 18, wherein at least a portion of 
the information in the database has dynamic features. 

27. The method of claim 18, wherein the steps of per- 
forming voice classification analysis are performed by a 
text-independent voice classification. 

28. The method of claim 18, further comprising the step 
of requerying the at least one question if the at least one 
answer is not accepted during the decoding step, 

29. The method of claim 18, wherein the steps of the 
method of controlling access of the speaker forms a speech 
biometric. 

30. The method of claim 18, wherein one of the database 
and the model may be built through pre-enrollment of the 
speaker. 

31. The method of claim 18, wherein one of the database 
and the model may be one of built and updated automatically 
during the method of controlling access of the speaker to one 
of the service and the facility. 

32. The method of claim 18, wherein at least a portion of 
the at least on e answer provided by the speaker is in a form 
other than speech. 

33. The method of claim 18, wherein the steps of per- 
forming voice classification analysis are performed on a 
random question. 
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34. Apparatus for controlling access of a speaker to one of 
a service and a facility, the apparatus comprising: 

.means for receiving first spoken utterances of the speaker, 
the first spoken utterances containing indicia of the 
speaker; 

means for decoding the first spoken utterances; 

means for accessing a database corresponding to the 
decoded first spoken utterances, the database contain- 
ing information attributable to a speaker candidate 
having indicia substantially similar to the speaker; 

means for querying the speaker with at least one question 
based on the information contained in the accessed 
database; 

means for receiving second spoken utterances of the is 
speaker, the second spoken utterances being represen- 
tative of at least one answer to the at least one question; 

means for decoding the second spoken utterances; 

means for verifying the accuracy of the decoded answer 
against the information contained in the accessed data- 20 
base serving as the basis for the question; 

means for taking a voice sample from the utterances of the 
speaker and processing the voice sample against an 
acoustic model attributable to the speaker candidate 
without requiring dependency on the decoded first and 
second spoken utterances; 

means for generating a score corresponding to the accu- 
racy of the decoded answer and the closeness of the 
match between the voice sample and the model; and 

means for comparing the score to a predetermined thresh- 
old value and if the score is one of substantially 
equivalent to and above the threshold value, then 
permitting speaker access to one of the service and the 
facility. 

35. Apparatus for controlling access of a speaker to one of 
a service and a facility from among a multiplicity of speaker 
candidates, the apparatus comprising: 
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means for receiving first spoken utterances of the speaker, 
the first spoken utterances containing indicia of the 
speaker; 

means for decoding the first spoken utterances; 

means for generating a sub-list of speaker candidates that 
substantially match the speakers decoded spoken utter- 
ances; 

means for activating databases respectively correspond- 
ing to the speaker candidates in the sub-list, the data- 
bases containing information respectively attributable 
to the speaker candidates; 

means for performing a voice classification analysis on 
voice characteristics of the speaker without requiring 
dependency on the decoded first spoken utterance; 

means for eliminating speaker candidates who do not 
substantially match these characteristics; 

means for querying the speaker with at least one question 
to the speaker that is relevant to the information in the 
databases of speaker candidates remaining after elimi- 
nation by the eliminating means; 

means for further eliminating speaker candidates based on 
the accuracy of the answer provided by the speaker in 
response to the at least one question; 

means for further performing the voice classification 
analysis on the voice characteristics from the answer 
provided by the speaker without requiring dependency 
on the decoded first spoken utterance; 

means for still further eliminating speaker candidates who 
do not substantially match these characteristics; and 

means for iteratively repeating the querying and voice 
classification analysis procedures until one of one 
speaker candidate and no speaker candidate remains, if 
one speaker candidate remains then permitting the 
speaker access and if no speaker candidate remains 
then denying the speaker access. 
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